title: “Replacement Class - Introduction to spatial data” date: “2023-11-13” author: “Gianna Capoccia”

Libraries: The code begins by loading several libraries, including: knitr for dynamic report generation. tidyverse for data manipulation and visualization. janitor for cleaning the data. lubridate for working with dates. here for better file path management. mapview, gbfs, sf, tmap, tidycensus for working with geospatial data, mapping, and accessing US Census Bureau data. Loading Geospatial Data: The code loads a GeoJSON file containing neighborhood geospatial data of Washington D.C. This data is stored in the object neigh. Loading and Cleaning COVID-19 Data: The code then loads a CSV file containing COVID-19 case data, cleans the column names, filters the data for a specific date, and separates the ‘neighborhood’ column into ‘code’ and ‘name’ columns. Joining Data: The cleaned COVID-19 data (df_cases) is then joined with the neighborhood geospatial data (neigh), creating a new data frame neigh2. Visualizing Data: The joined data is visualized using the tmap library, creating a choropleth map. Loading Census Data: The code uses the tidycensus library to load further demographic data from the American Community Survey, including median income, population, and black population for each tract in D.C. Joining Geospatial and Census Data: The code joins the neighborhood data with the Census data, taking care to aggregate the data correctly. The result is a new dataframe df2. Improved Visualization: The joined data is visualized in a choropleth map, showing median income, COVID-19 case rate, and percentage of black population by neighborhood. Loading and Visualizing Bike Ride Data: The code loads a CSV file containing bike ride data, converts the data to a spatial format, and intersects it with the neighborhood data to count the number of bike rides starting in each neighborhood. The counts are visualized on a map. In a broader perspective, this analysis is an example of data science applied to public health and urban planning. It uses geospatial analysis to understand the distribution of COVID-19 cases in relation to demographic and urban mobility patterns. This type of analysis can help policymakers understand the impact of the pandemic in different neighborhoods and inform targeted interventions or resource allocation. By incorporating bike ride data, the analysis could also provide insights into how mobility patterns relate to COVID-19 spread or how the pandemic has affected urban mobility.

Key topics:

Packages

Standards:

library(knitr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) # because we will probably see some dates
library(here) # more easily access files in your project
## here() starts at /Users/giannacapoccia/Downloads
library(mapview)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
library(gbfs)

Some additional packages focuses on today’s work:

library(sf) # working with simple features - geospatial
## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tmap)
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(tidycensus)

A link to a book on tmap: https://r-tmap.github.io/

Using the Neighborhood Geospatial Data (using /data)

I will use the GeoJSON file. (Newer, not necessarily better, but … a single file. Not smaller, but … this one is not big.)

https://opendata.dc.gov/datasets/DCGIS::dc-health-planning-neighborhoods/about

Using the Neighborhood Geospatial Data (using /data)

Load the neighborhood geospatial data as neigh.

neigh=st_read(here("DS241_Portfolio/data_raw", "DC_Health_Planning_Neighborhoods.geojson")) %>% clean_names()
## Reading layer `DC_Health_Planning_Neighborhoods' from data source 
##   `/Users/giannacapoccia/Downloads/DS241_Portfolio/data_raw/DC_Health_Planning_Neighborhoods.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 51 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99556
## Geodetic CRS:  WGS 84
class(neigh)
## [1] "sf"         "data.frame"

Ensure it plots.

plot(neigh)

Investigating joining spatial and non-spatial data

Download the DC covid database for positive cases and store at an appropriate place in your project.

Read the data as df_c and be sure to clean names.

df_c<-read.csv(here("DS241_Portfolio/data_raw","DC_COVID-19_Total_Positive_Cases_by_Neighborhood.csv")) %>% clean_names()

Now - let’s focus on a particular date (for no reason other than simplifying our analysis).

df_cases=df_c %>%
   filter(as_date(date_reported) == "2021-11-17") %>% 
  separate(neighborhood,into=c("code","name"),sep = ":")

Create the dataframe df_cases:

df_cases=df_c %>%
  filter(as_date(date_reported) == "2021-11-17") %>% 
  separate(neighborhood,into=c("code","name"),sep = ":") %>%
  mutate(code=case_when(code=="N35" ~"N0",
                        TRUE ~ code)) %>%
  select(-date_reported)

Regular joining (of dataframes)

Join the dataframes and make a chloropleth map using tmap.

neigh2=left_join(neigh,df_cases,by=c("code"))

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(neigh2) +tm_polygons("total_positives",alpha=0.5)

Joining with other spatial data

Let’s get some data using tidycensus. Need an API key https://api.census.gov/data/key_signup.html

# census_api_key("a79a56c112aaa21507cb11dc34ad936d0b92a2c4", install = TRUE)

What data is available — and what is the variable name?

(We are interested in the 5year American Community Survey.)

v20 = load_variables(2018, "acs5")

Get some data:

df_cencus=get_acs(geography = "tract",
                  variables=c("median_inc"="B06011_001",
                              "pop"="B01001_001",
                              "pop_black"="B02009_001"),
                  state="DC",geometry=TRUE,year=2021) 
## Getting data from the 2017-2021 5-year ACS
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

Make a plot to verify that you read the data:

class(df_cencus)
## [1] "sf"         "data.frame"
plot(df_cencus)

A BETTER VISUALIZATION

It’s in long format. Let’s make it wide.

df_cens=df_cencus %>% 
  select(-moe) %>% 
  pivot_wider(names_from = "variable", 
              values_from = "estimate")
  
 

tm_shape(df_cens) +tm_polygons("median_inc",alpha=.5)

How to join

Consider this problem:

  tm_shape(neigh2) +tm_borders(col="blue",lwd=5,alpha=.2)+
  tm_shape(df_cens) +tm_borders(col="red",lwd=1,alpha=.3)

OK - follow the challenging code elements:

#crs - need to be based on same coordinate reference system You need to add a coordinate system to the census data:

df_cens_adj=df_cens %>% st_transform(4326)

But which way do we join — and — think about how it should “aggregate” the data.

df_j=st_join(df_cens_adj,neigh2,largest=TRUE)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

Other order?

df_j_rev = st_join(neigh2,df_cens_adj,largest = TRUE)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

Since we want the geometry for the NEIGHBORHOODS, we need a different work a little harder:

df1=df_j %>% select(median_inc,pop,pop_black,code) %>%
  group_by(code) %>%
  summarise(pop_n=sum(pop),
            pop_black_n=sum(pop_black), 
            adj_median_income=sum(pop*median_inc)/pop_n) # Weighted sum of the population

plot(df1)

Now that we are aggregating in the right way, we can join.

# df2=left_join(neigh2,df1)

df2=left_join(neigh2,df1 %>% st_set_geometry(NULL))
## Joining with `by = join_by(code)`

And visualize:

df2=df2 %>% mutate(black_perc=pop_black_n/pop_n, covid_rate=total_positives/pop_n)
tm_shape(df2)+tm_polygons(c("adj_median_income","covid_rate","black_perc"))
class(df2)
## [1] "sf"         "data.frame"

Improve that visualization:

df2 %>% filter(!code == "N0") %>%
  tm_shape()+tm_polygons(c("adj_median_income","covid_rate","black_perc"), alpha =.4)

To draw causal conclusion from data, you need to run an experiment and control the stastitical treatments

Loading the Bikerider Data

bikes <- read.csv(here("DS241_Portfolio/data_raw","202309-capitalbikeshare-tripdata.csv")) %>% clean_names()
df1_s = bikes %>%
  slice_head(n=1000)
df1_sf <- st_as_sf(df1_s,coords = c("start_lng","start_lat"), crs = 4326)

class(df1_sf)
## [1] "sf"         "data.frame"
mapview(df1_sf)
inter <- st_intersects(df1_sf,df2)
df1_sf$count <- lengths(inter) 

class(df1_sf$count)
## [1] "integer"
class(inter)
## [1] "sgbp" "list"
ggplot(df1_sf) + geom_sf(aes(fill = count))

class(df1_sf) 
## [1] "sf"         "data.frame"
ggplot(df1_sf) + geom_sf(aes(fill = count))

df1_s_sf <- read_sf("DS241_Portfolio/data_raw/202309-capitalbikeshare-tripdata.csv")
glimpse(df1_s_sf)
## Rows: 450,090
## Columns: 13
## $ ride_id            <chr> "78909121C1F6C62C", "E3573C008D481AD0", "68611BE6E8…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2023-09-07 17:17:58", "2023-09-07 12:39:20", "2023…
## $ ended_at           <chr> "2023-09-07 17:18:51", "2023-09-07 12:49:32", "2023…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <chr> "38.92", "38.88", "38.9", "38.97", "38.97", "38.93"…
## $ start_lng          <chr> "-77.03", "-77.02", "-77.02", "-77.02", "-77.02", "…
## $ end_lat            <chr> "38.92", "38.89", "38.9", "38.97", "38.91", "38.92"…
## $ end_lng            <chr> "-77.03", "-77.01", "-77.01", "-77.02", "-77.03", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

Conclusions: This R code introduces several key topics in data science, particularly focusing on geospatial data manipulation and visualization. Here’s a breakdown of the tasks: 1. Loading Libraries: The code starts by loading the required R libraries, including libraries for data manipulation (tidyverse, janitor, lubridate), file path manipulation (here), and specific libraries for working with geospatial data and mapping (sf, tmap, tidycensus, mapview, and gbfs). 2. Reading Geospatial Data: The code reads a GeoJSON file containing neighborhood geospatial data using the st_read() function from the sf package. The file path is generated using the here() function, which is useful for creating file paths that work across different operating systems. 3. Visualizing Geospatial Data: The geospatial data is plotted using the plot() function to ensure that it has been loaded correctly. 4. Data Wrangling: The code reads a CSV file containing COVID-19 case data and filters it to focus on a specific date. It then separates the “neighborhood” column into “code” and “name” using the separate() function from the tidyr package. 5. Joining Spatial and Non-Spatial Data: The code performs a left join of the neighborhood geospatial data (neigh) and the COVID-19 case data (df_cases) using the left_join() function from the dplyr package. It then plots the joined data using the tmap package to create a choropleth map. 6. Retrieving and Joining Spatial Census Data: The code uses the tidycensus package to retrieve census data, including the median income, population, and black population for different geographic tracts in Washington, DC. The retrieved data is then joined with the neighborhood geospatial data. 7. Reshaping Data: The code reshapes the census data from long to wide format using the pivot_wider() function. This makes it easier to visualize the data. 8. Spatial Joining: The code demonstrates the process of joining spatial data. It first transforms the coordinate reference system of the census data to match that of the neighborhood geospatial data. It then performs spatial joins in different sequences and aggregates the data appropriately. 9. Visualization: The code creates a series of visualizations using the tmap package. These visualizations include choropleth maps of adjusted median income, COVID-19 rates, and the percentage of black residents in different neighborhoods. 10. Loading and Mapping Bike Rider Data: The code reads a CSV file containing data on bike riders and visualizes it. This process involves converting the data to a simple features object (using the st_as_sf() function), intersecting it with the neighborhood data (using the st_intersects() function), and visualizing the result. In conclusion, this R code focuses on handling geospatial data, from reading and visualizing such data, to joining it with non-spatial data, retrieving and working with spatial census data, and performing spatial joins. It also demonstrates the visualization of geospatial data using the tmap package.